Engineering posts about Service Level Objectives

Curated summaries and key learnings for engineers working with Service Level Objectives.

Cloudflare
11m

Code Orange: Fail Small is complete. The result is a stronger Cloudflare network

The article outlines the completion of Cloudflare's 'Code Orange: Fail Small' initiative, aimed at enhancing the resilience and reliability of its network infrastructure. Key improvements include the...

DigitalOcean
13m

From Incident Counting to SLIs: How DigitalOcean Rethought Availability

The article discusses DigitalOcean's transition from an incident-counting methodology to a more nuanced SLI-based approach for measuring availability. Initially, the company relied on a simplistic...

Cloudflare
8m

A one-line Kubernetes fix that saved 600 hours a year

The article discusses a critical performance issue encountered with Kubernetes when managing the Atlantis tool for Terraform changes. The problem stemmed from slow restarts due to a default behavior...

GitHub
6m

When protections outlive their purpose: A lesson on managing defense systems at scale

The article outlines the challenges faced by GitHub in managing defense mechanisms that protect the platform from abuse while ensuring legitimate users are not adversely affected. It highlights the...

Cloudflare
11m

Code Orange: Fail Small — Our resilience plan following recent incidents

The article outlines Cloudflare's 'Code Orange: Fail Small' initiative aimed at enhancing the resilience of its network following significant outages. It details the incidents that led to the plan,...

Atlassian
15m

Pull request intervention for infrastructure-as-code risks with Bitbucket custom merge checks

The article discusses Atlassian's approach to mitigating risks associated with infrastructure-as-code through the implementation of Bitbucket custom merge checks. It highlights the importance of...

Cloudflare
7m

Cloudflare outage on December 5, 2025

On December 5, 2025, Cloudflare experienced a significant outage affecting a portion of its network due to a configuration change related to its Web Application Firewall (WAF). The incident, which...